Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkstate: make tests more robust #249

Merged
merged 3 commits into from
Jun 28, 2023
Merged

Conversation

flotter
Copy link
Contributor

@flotter flotter commented Jun 27, 2023

checkstate: make check runners exit cleanly

The checks manager uses contexts to cancel check runners. This happens
during checkstate.PlanChanged() when the previous checkers are all
terminated, and the new checker definition is applied.

However, context cancellation is asynchronous (as highlighted in the
context documentation), so the check runners will continue to complete
even after the context cancellations returned.

Due to this race condition, the runner can still launch a check even after
the context cancellation is complete.

This problem is reproduced here:
https://gist.github.com/flotter/f6bf32d7cc33cb1d53a3ddc39fc3b64c

The result of this is easily seen the following unit test failure:

panic: internal error: reaper must be started

goroutine 75 [running]:
github.com/canonical/pebble/internals/reaper.StartCommand
/home/flotter/pebble/internals/reaper/reaper.go:164
github.com/canonical/pebble/internals/overlord/checkstate.(*execChecker).check()
/home/flotter/pebble/internals/overlord/checkstate/checkers.go:170
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).runCheck()
/home/flotter/pebble/internals/overlord/checkstate/manager.go:199
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).loop()
/home/flotter/pebble/internals/overlord/checkstate/manager.go:182
created by github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged
/home/flotter/pebble/internals/overlord/checkstate/manager.go:74

In this particular case, the problem is that test teardown stops the reaper
process which is responsible for reaping check commands, while the final
(edge case) run of the check is still completing.

Following the same logic, this could also cause check overlap between terminating
previous checks and starting new checks in checkstate.PlanChanged().

The following changes fix the problem:

  • Use a WaitGroup to represent the group of active checks in the check manager

  • Wait() on the group to complete before creating the next group of checks
    when PlanChanged is called.

  • Each check runner will use a deferred exit to decrement the group counter.

checkstate: make tests more robust

The following changes are made:

  • Remove a check which depends on check output in TestTimeout as
    there is no way to remove the race between the 'echo FOO'
    completing before the timeout is triggered, without significant rework.

    This failure can easily be reproduced by running the test under
    heavy cpu load (i.e. stress --cpu 10 --io 4)

    manager_test.go:140:
    c.Assert(check.ErrorDetails, Equals, "FOO")
    ... obtained string = ""
    ... expected string = "FOO"

  • Make waitCheck() more conservative when waiting for a check
    runner iteration to complete. The worse case timeout can
    safely be much longer as this does not have a general impact
    on test duration.

    This failure can easily be reproduced by running the test under
    heavy cpu load (i.e. stress --cpu 10 --io 4)

    manager_test.go:134:
    check := waitCheck(c, mgr, "chk1", func(check *CheckInfo) bool {
    return check.Status != CheckStatusUp
    })
    manager_test.go:283:
    c.Fatalf("timed out waiting for check %q", name)
    ... Error: timed out waiting for check "chk1"

The checks manager uses contexts to cancel check runners. This happens
during checkstate.PlanChanged() when the previous checkers are all
terminated, and the new checker definition is applied.

However, context cancellation is asynchronous (as highlighted in the
context documentation), so the check runners will continue to complete
even after the context cancellations returned.

Due to this race condition, the runner can still launch a check even after
the context cancellation is complete.

This problem is reproduced here:
https://gist.github.com/flotter/f6bf32d7cc33cb1d53a3ddc39fc3b64c

The result of this is easily seen the following unit test failure:

panic: internal error: reaper must be started

goroutine 75 [running]:
github.com/canonical/pebble/internals/reaper.StartCommand
	/home/flotter/pebble/internals/reaper/reaper.go:164
github.com/canonical/pebble/internals/overlord/checkstate.(*execChecker).check()
	/home/flotter/pebble/internals/overlord/checkstate/checkers.go:170
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).runCheck()
	/home/flotter/pebble/internals/overlord/checkstate/manager.go:199
github.com/canonical/pebble/internals/overlord/checkstate.(*checkData).loop()
	/home/flotter/pebble/internals/overlord/checkstate/manager.go:182
created by github.com/canonical/pebble/internals/overlord/checkstate.(*CheckManager).PlanChanged
	/home/flotter/pebble/internals/overlord/checkstate/manager.go:74

In this particular case, the problem is that test teardown stops the reaper
process which is responsible for reaping check commands, while the final
(edge case) run of the check is still completing.

Following the same logic, this could also cause check overlap between terminating
previous checks and starting new checks in checkstate.PlanChanged().

The following changes fix the problem:

- Use a WaitGroup to represent the group of active checks in the check manager

- Wait() on the group to complete before creating the next group of checks
  when PlanChanged is called.

- Each check runner will use a deferred exit to decrement the group counter.
The following changes are made:

- Remove a check which depends on check output in TestTimeout as
  there is no way to remove the race between the 'echo FOO'
  completing before the timeout is triggered.

  This failure can easily be reproduced by running the test under
  heavy cpu load (i.e. stress --cpu 10 --io 4)

  manager_test.go:140:
      c.Assert(check.ErrorDetails, Equals, "FOO")
  ... obtained string = ""
  ... expected string = "FOO"

- Make waitCheck() more conservative when waiting for a check
  runner iteration to complete. The worse case timeout can
  safely be much longer as this does not have a general impact
  on test duration.

  This failure can easily be reproduced by running the test under
  heavy cpu load (i.e. stress --cpu 10 --io 4)

  manager_test.go:134:
      check := waitCheck(c, mgr, "chk1", func(check *CheckInfo) bool {
          return check.Status != CheckStatusUp
      })
  manager_test.go:283:
      c.Fatalf("timed out waiting for check %q", name)
  ... Error: timed out waiting for check "chk1"
@@ -45,12 +45,14 @@ func (s *ManagerSuite) SetUpSuite(c *C) {
setLoggerOnce.Do(func() {
logger.SetLogger(logger.New(os.Stderr, "[test] "))
})
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved reaper.Start() and reaper.Stop() to test level. The command cleanup should happen before the next test starts, and it also provides a more robust test environment for race conditions, as now each test must cleanly exit before the reaper is stopped.

Note that the reaper is bound to the package level test binary process ID, and therefore moving it to the test setup and teardown make it incompatible for parallel testing. However, parallel testing is strictly opt-in, so this is still a valid requirement for this particular package tests.

@@ -161,17 +162,15 @@ func (s *ManagerSuite) TestCheckCanceled(c *C) {
},
})

// Wait for command to start (output file grows in size)
prevSize := 0
// Wait for command to start (output file is not zero in size)
Copy link
Contributor Author

@flotter flotter Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous logic did not make 100% sense, so I simplified it a bit.

c.MkDir() is test managed and removed after the test. The tempfile is therefore always zero to start with, and so we can simply check for a non-zero value. In the original code, the prevSize = len(b) appeared redundant as the previous check would break out before it could happen for any non-zero value.

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is really good work and detailed analysis. Much appreciated!

In terms of review: approved, but with a couple of very small suggestions.

internals/overlord/checkstate/manager.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager_test.go Outdated Show resolved Hide resolved
internals/overlord/checkstate/manager_test.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@anpep anpep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At last! Thanks for the great work

@flotter flotter removed the request for review from barrettj12 June 28, 2023 12:40
@flotter flotter added the High Priority Look at me first label Jun 28, 2023
@@ -137,7 +139,6 @@ func (s *ManagerSuite) TestTimeout(c *C) {
c.Assert(check.Failures, Equals, 1)
c.Assert(check.Threshold, Equals, 1)
c.Assert(check.LastError, Equals, "exec check timed out")
c.Assert(check.ErrorDetails, Equals, "FOO")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained in the commit message, this check is too racey. There is no way there is any guarantee this would be done before the timeout cuts off the command execution. Given that this test focuses on the timeout mechanism, and not command output logging, I feel it is justified to remove this to improve test robustness.

@@ -185,7 +184,6 @@ func (s *ManagerSuite) TestCheckCanceled(c *C) {
stopChecks(c, mgr)

// Ensure command was terminated (output file didn't grow in size)
time.Sleep(50 * time.Millisecond)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stopChecks() calls PlanChanged(), which terminates the previous checks synchronously now, so this is not needed.

@benhoyt
Copy link
Contributor

benhoyt commented Jun 28, 2023

Small stability improvements, no API changes -- going to merge this without further review. We can always tweak in subsequent PRs if needed.

@benhoyt benhoyt merged commit 6892f25 into canonical:master Jun 28, 2023
@barrettj12
Copy link
Contributor

@flotter I'm seeing a similar intermittent failure in the servstate tests - see #224. Maybe a similar fix could help here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Look at me first
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants